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LOCAL AND GLOBAL CONVERGENCE OF AN INERTIAL 
VERSION OF FORWARD-BACKWARD SPLITTING * 
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Abstract. A problem of great interest in optimization is to minimize a sum of two closed, 
proper, and convex functions where one is smooth and the other has a computationally inexpensive 
proximal operator. In this paper we analyze a family of Inertial Forward-Backward Splitting (I-FBS) 
algorithms for solving this problem. We first apply a global Lyapunov analysis to I-FBS and prove 
weak convergence of the iterates to a minimizer in a real Hilbert space. We then show that the 
algorithms achieve local linear convergence for “sparse optimization”, which is the important special 
case where the nonsmooth term is the £i-norm. This result holds under either a restricted strong 
convexity or a strict complimentary condition and we do not require the objective to be strictly 
convex. For certain parameter choices we determine an upper bound on the number of iterations 
until the iterates are confined on a manifold containing the solution set and linear convergence holds. 

The local linear convergence result for sparse optimization holds for the Fast Iterative Shrinkage 
and Soft Thresholding Algorithm (FISTA) due to Beck and Teboulle which is a particular parameter 
choice for I-FBS. In spite of its optimal global objective function convergence rate, we show that 
FISTA is not optimal for sparse optimization with respect to the local convergence rate. We determine 
the locally optimal parameter choice for the I-FBS family. Finally we propose a method which inherits 
the excellent global rate of FISTA but also has excellent local rate. 

Key words, proximal gradient methods, forward-backward splitting, inertial methods, i\- 
regularization, local linear convergence 

AMS subject classifications. 65K05, 65K15, 90C06, 90C25 


1. Introduction. We are concerned with the following important problem: 


minimize F(a;) =/(a;) + ^(a;), (1-1) 

x^l-L 

where "H is a Hilbert space over the real numbers, the functions / : "H —?> K U {+ 00 } 
and g : H ^ M.U {+ 00 } are proper, convex and closed, and in addition / is Gateaux 
differentiable, and has a Lipschitz continuous gradient. Problems of this form have 
come under considerable attention in recent years in applications such as machine 
learning mm, compressed sensing mm and image processing Em among many 
other examples. Of particular interest in this paper will be the special case which we 
will call sparse optimization (SO). 


(Problem SO) 


minimize F(x) = f{x) + p||a:||i. 


where p > 0, and ||a::||i = \^i\- We refer to this problem as “sparse optimization” 

because the £i-norm encourages sparse solutions. When f{x) = i||6 — with A € 
jjmxn ^ Problem SO is often referred to as sparse least squares (Problem 

fi-LS), basis pursuit denoising or LASSO. This problem is of central importance 
in compressed sensing and also has applications in machine learning [7] and image 
processing [8]. Other important instances of Problem (1.1) include least squares with 
a total-variation or nuclear-norm m regularizer, and minimization constrained to 
a closed and convex set. 


*The proofs of Thms. 4.1, 5.1, 5.2, and 5.6 of this manuscript contain several errors. These errors 
have been fixed in a revised and rewritten manuscript entitled “Local and Global Convergence of 
a General Inertial Proximal Splitting Scheme” arxiv id. 1602.02726. We recommend reading this 
updated manuscript. 

^Beckman Institute, University of Illinois, 405 N. Mathews Ave., Urbana, IL, 61801, USA (contact: 
prjohns2@illinois.edu) 
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1.1. Background. In this paper we focus on first-order splitting methods for 
solving Problem (1.1|. These methods use evaluations of F, gradients of the smooth 
part /, and evaluations of the proximal operator of the nonsmooth part g. In particular 
we focus on the forward-backward splitting algorithm (FBS), which is a classical first- 
order splitting approach to solving Problem 0 [TTl [12] . In fact FBS was developed 
for the more general monotone inclusion problem which includes Problem 0 as a 
special case. FBS involves a “forward” step, which is an explicit gradient step with 
respect to the differentiable component / and a “backward” step, which is an implicit, 
proximal step with respect to g. For many popular instances of g this proximal step 
is computationally inexpensive m- The convergence rate of the objective function to 
the infimum is 0{l/k), which is better than the 0{l/\/k) rate achieved by the “black¬ 
box” subgradient method, and is the same as if the possibly nonsmooth component 
were not present. Weak convergence of the iterates is also guaranteed and linear 
convergence occurs on strongly convex problems El- FBS is also commonly referred 
to as the proximal gradient method m and for the special case of Problem SO, it is 
known as the iterative shrinkage and soft thresholding algorithm (ISTA) owing to the 
form of the proximal step w.r.t. the £i-norm [nidTiiiHi. Other first-order splitting 
methods include ADMM m, linearized and preconditioned ADMM [5D| , primal-dual 
methods EH, Bregman iterations [55] and generalized FBS (53]. These methods can 
deal with more complicated situations such as when g is composed with a bounded 
linear operator or when the sum of m > 1 proximabl^ functions is present. 

Nesterov developed several methods for minimizing a convex function with Lips- 
chitz gradient ([24], [25] chapter 2). These methods obtain the best objective function 
convergence rate possible by any first order method. Specifically, they guarantee a 
convergence rate of 0(l/fc^) for the objective function, which is optimal in the worst 
case sense for convex functions with Lipschitz gradient. Note that this improves the 
0{l/k) rate achieved by classical gradient descent. 

In [T3], Beck and Teboulle extended Nesterov’s method of [5^ to Problem (1.1), 
allowing for the presence of the possibly nonsmooth function g. Their method, FISTA, 
combines Nesterov’s inertial update into an FBS framework using the same sequence 
of “momentum” parameters. FISTA corresponds to a particular parameter choice 
for the following suite of algorithms, which we will call Inertial Forward-Backward 
Splitting (I-FBS). 


(I-FBS) : Vfc e N, 


yk-i-i — + ak{x'^ — 

= prox,^^ (/+! - AfeV/(j/'=+i)) 


with £ H chosen arbitrarily (typically x^ = x^). The sequences {afcjfcgN 

and {AfcjfcgN are a subset of K>o. The proximal operator prox^ : H ^ H will be 
properly defined in Section [2?2l Beck and Teboulle showed that for a specific choice 
of {cifelfegN and {AfcjfcgN, I-FBS obtains the optimal 0(l/fc^) rate in terms of the 
objective function, however they did not prove convergence of the iterates to a 

minimizer which is also unknown for Nesterov’s method. Tseng |26j showed that other 
choices also achieve 0{l/k'^) rate. Recently in [27], Chambolle and Dossal considered 
a very similar choice of the parameters to Beck and Teboulle which obtains 0(l/fc^) 
rate in the objective function and also weak convergence of the iterates to a minimizer. 
Throughout the rest of the paper we will refer to all these parameter choices for I-FBS 


^Possessing a simple proximal operator. 
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that obtain the 0{l/k‘^) objective function rate as FISTA-like choices. Note that FBS 
corresponds to I-FBS with ak set to 0 for all fc S N and Xk in the range (0, 2/L) where 
L is the Lipschitz constant of V/. Nesterov’s method of [53] corresponds to I-FBS 
with the same parameter choice as FISTA and with g = 0. 

One of the aims of this paper is to establish broad conditions for the convergence 
of the iterates of I-FBS to a minimizer of Problem (1.1). A generalization of the 
1-FBS family has been studied previously in |28j in the setting of monotone operator 
inclusion problems. However our global analysis proves convergence for a wider range 
of parameter choices than was proved there. An algorithm similar to 1-FBS was de¬ 
veloped in |29j for the more general problem of finding a fixed-point of a nonexpansive 
operator. However the conditions for convergence are far more strict than those devel¬ 
oped in this paper. To the best of our knowledge the conditions for weak convergence 
of the iterates of 1-FBS developed in this paper are novel in the literature. A more 
detailed comparison with existing literature is given in Section]^ 

It has been observed that for the special case of Problem SO FBS exhibits local 
linear convergence (see e.g. [HI HB1301 [31]), elsewhere called eventual linear conver¬ 
gence [33]. By this it is meant that there exists some N > 0 such that for all A: > V 
the iterates are confined to a manifold containing the solution set and convergence 
to a solution is linear. It is not known whether 1-FBS (including the FISTA-like 
choices) obtains local linear convergence for Problem SO, however recently [33] has 
made progress for the special case of Problem .£i-LS. In this paper, we address this by 
establishing local linear convergence of TFBS for Problem SO for a broad range of pa¬ 
rameter choices including the FISTA-like choices. Of course local linear convergence 
of the sequence implies convergence of the entire sequence. 


1.2. Contributions of this Paper. In the first part of the paper, we analyze 
I-FBS with an appropriate multi-step Lyapunov function. This approach allows us 
to develop novel conditions on the algorithmic parameters that imply convergence 
of the iterates to a minimizer (weak convergence in a real Hilbert space, ordinary 
convergence in K"). This widens the range of possible parameter choices beyond 
those proposed in prior art such as [35] ■ 

In the second part of the paper, we consider in detail the behavior of I-FBS applied 
to Problem SO. We show that after a finite number of iterations I-FBS reduces to 
minimizing a local function on a reduced support subject to an orthant constraint. 
This result holds for the FISTA-like choices along with a wide range of other parameter 
choices. Next we show that a simple “locally optimal” parameter choice for I-FBS 
obtains a local linear convergence rate with the best asymptotic iteration complexity. 
The asymptotically optimal iteration complexity is better than that obtained by the 
FISTA-like choices and by ISTA. The improvement gained by I-FBS over ISTA when 
the correct amount of momentum is added is equivalent to the improvement that 
Nesterov’s accelerated method [33] achieves over gradient descent for strongly convex 
functions with Lipschitz gradients. As a corollary of our analysis, we show that the 
adaptive momentum restart scheme proposed in |34j achieves the optimal iteration 
complexity. In conrast the analysis in [34] is only valid for strongly convex quadratic 
functions. Finally for parameter choices for which the “momentum parameter” ak 
is bounded away from 1, we determine an explicit upper bound on the number of 
iterations until convergence to the optimal manifold. 

With little effort our analysis of TFBS for Problem SO can be adapted to apply 
to the splitting inertial proximal method (SIPM) proposed by Moudafi and Oliny [35] . 
This method is a direct generalization of the heavy ball with friction method (HBF) 



4 


PATRICK JOHNSTONE AND PIERRE MOULIN 


[36] to proximal splitting problems and differs from I-FBS in that the gradient of / 
is computed at rather than . We show that SIPM also achieves local linear 
convergence for this problem under appropriate parameter constraints. 

The paper is organized as follows. In Section notation and assumptions are 
discussed. In Section [^ we precisely define the I-FBS family and discuss known 
convergence results in more detail. In Section]^ we apply our Lyapunov analysis to 
I-FBS. In Section[^we derive convergence results for Problem SO. Finally, numerical 
experiments are presented in Section]^ 

2. Preliminaries. 

2.1. Notation and Definitions. Throughout the paper, H is a Hilbert space 
over the field of real numbers, (•, •) is the inner product and || • || is the associated 
norm. Let Tq{T-L) be the set of all closed, convex and proper functions whose domain 
is a subset of H and range is a subset of M U {-l-oo}. For any g : H —>■ K U {-l-oo} and 
point X G H, we denote by d(^g(x) for e > 0 the e-enlargement of the subdifferential, 
defined as the set 


deg{x) = {v GU: g{y) > g{x) + {v,y-x) - e, Vy € %) 


( 2 . 1 ) 


which is always convex and closed and may be empty. We will use dg to denote d^g. 
When dg{x) is a singleton we will call it the gradient at x, denoted by Vg{x). 

For a : K —)■ K and 6 : M —>■ K, the notation a{k) = 0{b{k)) (resp. a{k) = 
H(6(/c))) means there exists a constant C > 0 such that limfe_>oo a{k)/h{k) < C (resp. 
limfc_).oo a(fc)/fe(^) > C). The notation a(k) = o{b{k)) means cL{k)/b{k) = 0. 

We will say a sequence C H converges linearly to x* G % with rate of 

convergence q G (0, 1), if \\x^ — a^*|| = 0{q^). To be precise we will occasionally refer 
to this as asymptotic or local linear convergence. Note that this is different from 
nonasymptotic, or global linear covergence with rate q, in which case there exists a 
C > 0 such that Wx’^ ~ a^*|| < Cq^ for all k gN. In contrast local linear convergence 
allows for a finite number of iterations where such a relationship does not hold. 


Define the optimal value of Problem (1.1) as 


F* = inf F{x) 
xeu 


and the solution set as 


X* = {xGn: F{x) = F*}. 

Given a function a : K —)■ K, we say that the iteration complexity of a method for 
minimizing P is H (a(e)) if k = ^} (a (e)) implies F{x^) — F* = 0{e). To be precise we 
will occasionally refer to this as the asymptotic iteration complexity. 

For a matrix A G and a set S C {1,2,..., n}, Hs will denote the matrix 

in formed by taking the columns corresponding to the elements of S. For a 

vector V G ffi", vs will denote the |S'| x 1 vector with entries given by the entries of v 
on the indices corresponding to the elements of S, and (vs, 0) will denote the vector 
in K" equal to v on the indices corresponding to S and equal to zero everywhere else. 
Given c € K and x G K", sgn(c) is defined as -1-1 if c > 0 and — 1 if c < 0, sgn(a;) is 
simply applying sgn(-) element-wise. We will use the notation [c]+ = max(c, 0). 
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2.2. Proximal Operators. The proximal operator prox^ : H ^ H w.r.t. a 
function g G is defined implicitly by 

y - WOXgiy) G dg{pTOXg{y)), 


and explicitly by 


prox (y) = argmin<^ 7 ;\\x - vW^ + 9{x) 


( 2 . 2 ) 


Since the function being minimized in (2.2) is strongly convex and in pioXg^y) 

exists and is unique for every y G H thus it is a well defined mapping with domain 
equal toH. To be more general we will actually use the e-enlarged proximal operator, 
which is the set 


Proxg(y) = {v:y-vG deg{v)}, 

which is not necessarily uniquely defined (except when e = 0). Note that proXg(j/) G 
proXg( 2 /) for all e > 0. The use of prox^ allows for some approximation error in the 
computation of the proximal operator. 

2.3. Cocoercivity and Convexity. We say that a Gateaux differentiable and 
convex function / has a ^^-cocoercive gradient with L > 0, if 

(V/(y)-V/(a:),y-x) > ^||V/(y) - V/(cr)f, ^x.y gU. (2.3) 

Note this is equivalent to the gradient being L-Lipschitz continuous, i.e. 

l|V/(2/)-V/(x)|| <L||y-x||, Vx,yGn, (2.4) 

For a proof see m Lemma 1.4 and the Baillon-Haddad Theorem [35]. We will need 
the following two standard properties of such a function. For all m, u S Tt: 

f{u) - f{v) < {Vf{v),u-v) + , (2.5) 


and (by convexity) 


f{u) - f{v) < (V/('u), u-v). 


( 2 . 6 ) 


We are now ready to formerly state our Assumptions for Problem (1.1). 


Assumption 1. / and g are in ro(H), / is Gateaux differentiable everywhere 
and has a 1/L-cocoercive gradient with L > 0, and F* > —oo. 


2.4. Properties of Sparse Optimization. We now outline our assumptions 
for Problem SO and discuss some of its properties. 

Assumption SO. / G ro('H), is twice differentiable everywhere, and has a 1/L- 
cocoercive gradient with L > 0. F* > —oo and X* is non-empty. 

The main difference between Assumption SO and Assumption 1 is that we addi¬ 
tionally assume that / is twice differentiable. Let F[{x) denote the Hessian of / at x. 
Then the Lipschitz constant L of the gradient is equal to the supremum of the largest 
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eigenvalue of H{x) over all x. Furthermore note that || • |ji is in ro(H). Finally note 
that for p > 0 the function f{x) + p||a;||i is coercive thus X* is non-empty. 

Problem SO includes Problem £i-LS, defined as 

(Problem £i-LS) minimize F(a;) = ^||5 — -I-p||a:||i, 

where A S and b S M™. The solution set X* of Problem .^i-LS is always non¬ 

empty. The function / has gradient equal to A^(Ax — b) which is Lipschitz-continuous 
with Lipschitz constant L equal to the largest eigenvalue of A^A. 

The proximal operator associated with p|| • || i is the shrinkage and soft-thresholding 
operator Sp{v) : M —)• M, applied element-wise. It is defined as Sp{v) = [|u| — p],,, sgn(z;), 
and thus 


{pi'OXp||.||i( 2 ;)}i = S'p(zi), j = l,2,...,n. (2.7) 

In the analysis of TFBS applied to Problem SO we will need the following result 
proved in m- 

Theorem 2.1 (Theorem 2.1 [17]). For problem SO suppose Assumption SO 
holds, then there exists a vector h* € K” such that for all x* € X* , Xf{x*) = h*. 
Furthermore, for alii ^ {1,2,, n}, 

, * r = —1 if3 X & X* : > 0 
— < = -1-1 if3xG X* : Xi < 0 
^ G [—1,1] else. 


The following two sets also used in m will also be crucial to our analysis. Let D = 
{i : |/i*| < p} and E = {i ■.\h*\ = p{. Note that DOE = % and DU E = [1,2,..., n}. 
By Theorem 2.1 we can infer that supp(a;*) C E for all x* € X*. Finally, define 

uj = min{p —\h*\ : i € D} > 0. 


We will need the following Lemma proved in [13- 

Lemma 2.2 (Lemma 4.1 |17j l. Under Assumption SO, if X G [0, 2/L), 


\\x - XX fix) - {y - XX f{y))\\ < ||a;-y||, Va;,y gK”. 


An alternative definition of cocoercivity is to say that if an operator T : FL ^ FL is 
7 -cocoercive than qT is firmly nonexpansive. Thus Lemma 2.2 is just an elementary 
property of firmly nonexpansive operators (see Proposition 4.2 (iii), and Proposition 
4.33 [31]). 


Finally, the following properties of S^ will be useful. 

Lemma 2.3 (Lemma 3.2 [17jl . Fix any a and b in K, and v > 0.' 
• The function Si, is nonexpansive. That is. 


|5'i.(a) - S„{b)\ < |a - b\. 
• If \b\ ^ ^ eind sgnia) ^ sgnib) then 


|S'i,(a) - Si,ib)\ < \a- b\ - 1 ^. 
• If Si^ia) ^ 0 = S'i/(5) then |a| > v, \b\ < v and 


|S'^(a) - S'i/(6)| <\a-h\ - [v- |6|). 


(2.8) 
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3. I-FBS. To be more general, our global analysis will apply to the following 
TFBS family. 


(I-FBS-e) : Vfc G N 


yk+l _ J-fc _|_ _ J-fc 1 ^ 

^k+1 g prox^^^g {y^+^ - AfeV/(y'=+i)) 


with x^,x^ G H chosen arbitrarily. Note that for any Ck > 0, 

prox,^g(y'=+i - AfcV/(2/'=+i)) G prox^-;^ (/+i - AfcV/(j/^+i)) . 

We will refer to {afcjfcgN as the “momentum” parameters and {AfejfegN as the “step- 
size” parameters. The algorithm differs from I-FBS in that it uses the e-enlarged 
sub-differential, allowing for some error in the computation of the proximal operator. 

3.1. Known Convergence Results. Beck and Teboulle m proposed the fol¬ 
lowing choice of parameters for I-FBS (TFBS-e with the Cfc set to 0 for all k G N), 


Vk G N, Afe = y, Uk = ^ 


where tk-\-i = 


l + y/4fTT 


h = 1. (3.1) 


The method is known as FISTA. With this choice of parameters, Beck and Teboulle 
showed that the objective function converges to the minimum at the worst-case opti¬ 
mal rate of 0(l/fc^). In fact the 0(l/fc^) rate holds for a variety of choices of {afcjfcgN 
which all have the form: a/c = 1 — 0{l/k) [26]. However the choice in (3.1) guaran¬ 


tees the largest possible decrease in a given upper bound of F{x^) at each iteration. 
Chambolle and Dossal m considered I-FBS with a similar choice of {afcjfcgN to what 
was proposed by Beck and Teboulle. They investigated, for some a > 2, 

Vfc G N, 0 < Afc < -|-, Ofc = ^ where 4+i = h = 1. (3.2) 

With this choice of parameters, the authors showed that the objective function achieves 
the optimal 0(l/fc^) convergence rate and in addition weakly converges to a 

minimizer. 

In contrast to m and HZ], our analysis establishes weak convergence of the 
iterates for a wide range of parameter choices. Indeed, the momentum sequence is 
not constrained to follow a particular sequence relationship, but instead must be 
constrained to ak G [0,1] and limsupafc < 1. However we do not guarantee the 
0(l//c^) objective function rate. 

Lorenz and Pock [55] generalized TFBS to the problem of finding a zero of the 
sum of two maximal monotone operators A and B, one of which is cocoercive. Setting 


A = V/ and B = dg recovers Problem (1.1). They also replaced the scalar step-size 
Afe with a general positive definite operator XkM~^. Lorenz and Pock proved weak 
convergence of the iterates to a solution provided certain restrictions on {ofej/cgN and 
{AfcjfegN- The restrictions on ak are stronger than those derived in our global analysis. 
In their analysis, if the step-size Xk is fixed to 1/L^k is restricted to be less than 
1/5 — 2 « 0.24, whereas, as we shall see in Section]^ our Lyapunov analysis allows 
ak G [0,1], so long as limsupafc < 1. For the step-size, their conditions are less 
restrictive than ours, allowing for values of Afc up to 2/L, whereas our analysis only 
allows up to 1/L. However in their analysis larger values of Afc lead to a smaller range 
of feasible values for ak reducing to 0 as Afc approaches 2/L. 

In [55], an inertial version of the classical Krasnosel’skii-Mann (KM) algorithm 
was analyzed. The KM algorithm hnds the fixed points of a nonexpansive operator 










PATRICK JOHNSTONE AND PIERRE MOULIN 


T Setting the operator T{x) = prox;^^g(a; — \k^f{x)) in the inertial KM 

method of [2^ recovers I-FBS, since a point a; is a hxed point of T if and only if it is a 
solution of Problem (1.1). The analysis of [5S] proves weak convergence of the iterates 
to a hxed point but relies on verifying conditions of the form: ^ Q;fc||a;^ — x^~^\\p < oo 
with p equal to 1 and 2. In general this condition must be enforced online, restricting 
the range of possible choices for the sequence of momentum parameters. However, it 
was shown in |40j that choosing to be nondecreasing and satisfying C [0, a) with 
a < 1/3 suffices to ensure the condition is satished and thus prove weak convergence. 
This condition is more restrictive than the ones derived in this paper for the special 


case of Problem (1.1). 


3.2. Known Convergence Results for Sparse Optimization. The FISTA- 


like sequences for dehned in (3.1) and (3.2) both converge to 1. As we will 

see in Section [57^ this is not desirable for Problem SO. In the language of dynamical 
systems, when the momentum is too high the iterates move into an “underdamped 
regime” leading to oscillations in the objective function and slow convergence (see [31] 
for an analysis in the strongly-convex quadratic case). We will show that for Problem 
SO the FISTA-like choices are not optimal from the viewpoint of asymptotic rate 


of convergence under a local strong-convexity assumption (Corollary 5.4) or a strict 
complimentarity condition (Corollary |5.5[ ). 

the behavior of ISTA and FISTA (i.e. I-FBS with parameter choice (3.1)) 


In 


applied to Problem £i-LS was investigated through a spectral analysis. The authors 
show that both algorithms obtain local linear convergence for this problem, under 
the condition that the minimizer is unique, but without an estimate for the number 
of iterations until convergence to the optimal manifold. Furthermore they determine 
that the local rate of convergence of FISTA is worse than ISTA, while the transient 
behavior of FISTA is better than ISTA. Therefore they suggest switching from FISTA 
to ISTA once the optimal manifold has been identihed. Our contribution differs in 
several ways. We note that the poor local performance of the FISTA-like choices is 
due to having the momentum parameter converge to 1. Therefore we determine the 
optimal value for the momentum parameter that should be used in the asymptotic 
regime which allows for a better asymptotic rate than both ISTA and FISTA and 
suggest a heuristic method for estimating the optimal momentum. We also show that 
the adaptive restart method of |31| will achieve the 0{l/k^) rate in the transient 
regime and the optimal asymptotic rate. Furthermore our analysis holds for Problem 
SO with Problem £i-LS as a special case and we do not require the minimizer to be 
unique. Finally, in the case where limsupofc < 1, we provide explicit upper bounds 
on the number of iterations until I-FBS has converged to the optimal manifold. 

In |5T| a method was developed for solving Problem (1.1) when / is strongly 


convex. The method is equivalent to I-FBS with the same prescription for {afcjfcGN 
as determined by Nesterov for his method for minimizing strongly convex functions 
(constant scheme 2.2.8. of [25]). However it also includes a backtracking procedure 
for adjusting {AfcjfcgN and {afcjfcGN when the strong convexity and Lipschitz gradient 
parameters are not known. The authors of m also extended their method to Prob¬ 
lem £i-LS including the case where / is not strongly convex. The authors showed 
that under conditions on the matrix A related to the Restricted Isometry Property 
(RIP) used in compressed sensing, their algorithm obtains nonasymptotic (global) 
linear convergence, so long as the initial vector is sufficiently sparse. However, as the 
authors note the RIP-like conditions are much stronger than those typically found 
in the literature. Indeed the conditions are much stronger than those required in 
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our proof of local linear convergence. We establish that I-FBS obtains local linear 
convergence regardless of the initialization point. Furthermore no RlP-like assump¬ 
tions are necessary. Local linear convergence can be proved under the mild condition 
that the smallest eigenvalue of the Hessian restricted to the support of a minimum 
is non-zero at the minimum point. Or if this does not hold, under a common strict- 
complementarity condition (see Section 5.4). That being said, it should be noted that 
local linear convergence is not as strong a statement as global linear convergence 


4. A Global Analysis of I-FBS. This section derives conditions on {efejfegN, 
{afc}/cgN and {Afe}/cgN which imply weak convergence of the iterates of I-FBS 

to a minimizer of Problem (1.1). Throughout the rest of the paper, let denote 

^.fc-i-i _ j.k_ Qjygjj^ g j' dehne d{S,T) = Tniiig^s,teT IIs — t||. 


Theorem 4.1. Suppose that Assumption 1 holds. Assume is non¬ 

decreasing and satisfies 0 < Afc < 1/L for all k, and {efcjfegN satisfies Ck > 0 for all k 
and Ck < oo. // 0 < o/c < 1 for all k and limsupofe < 1, then for the iterates of 
I-FBS-e, we have 


(^) Er=iiAfef <oo. 

(a) limfc^oo d(0, V/(x'=) -f de^g{x^)) = lim^^oo d{0, Vf{y^) + d^^g{y’^)) = 0. 

(Hi) If, in addition, X* is non-empty, then converges weakly to some x S X*. 

Proof The proof consists of two parts. In the hrst, we prove statements ([^ and 
(0 using arguments inspired by Alvarez’ analysis of the inertial proximal method in 
[42] . In the second part, we invoke Opial’s Lemma [13] to prove statement ( pil| . The 
second part is inspired by the analysis of the splitting inertial proximal algorithm by 
Moudah and Oliny in [35] . 


Proof of statements ([^ and {[^. Dehne the Lyapunov function, or discrete 
energy, to be 


+ /(x'=)+5(x'=). 


Note that this is the same energy function used by Alvarez jUj. Inequalities (2.5) 
(2.6) and (2.1) imply 


Ek+i 


Ek = ^11 - ^IIAfcf + - f{x^) + - g{x'^) 

lAk+l 

< ;^||A,+if - ^IIAfcf + (V/(/+i) + x, A,.+i) 

ZAk+l ZAk 

+ + (4.1) 


Note that the existence of a subgradient v is g uara nteed because the e-enlarged prox¬ 
imal operator has domain equal to H. Using (4.1), the x^'^^ - update in I-FBS and 
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the fact that Xk < Afc+i, we infer that 


Ek+i — Ek < ||Afc-|-i 


2A 


fe+i 


ZAk Afe 


fc+i; 


L. 


+ ■^11^^+! ~ Q^fc^fclP + Cfe 


<afc+i 


,|Afc+if - ^\\Akr - f (Afe+1 - akAk,Ak+i) 

+ ■^(11^^+111^ + a||| Afelp) — akL{Ak+i, Ak) + ek 


1 


" ^2 Afc 


afe+i. 


2Xk 
afe(l — XkL) 


ife+i| 




Xk 

afc(l — XkL) 

2A^^ ' 

EoikiX I 

^k ' 


(Afc+1, Ak) + Cfc 


Afc+i — Ak\\ — 


2 + XkLiak — 1) — ttfe — ak+i 
^k ' 


ifc+i| 


Ak 


ek- 


Moving terms to the other side and summing implies, for all N G Z+, 


N 

E 


XkL )..2 2 J- XkL{^(y.k 1) o^k i 

—-IIAfc+i - Afell +' 


k=l 

Lakil - ak) 


2Xk 


A 


fe+il 


+ 


2Afc 


I At- 


N 


< El — Ef^+i + ^ 


ek 


k=l 


N 


k=l 


<Ei-F* +Y,^k = F{x^) -F* + ^liAif + Y,^k<^. 


N 


(4.2) 


k=l 


Inequality (4.2) along with the assumptions on {ofelfegN and {A/c}fcgN imply statement 
Statement ([^ implies ||Afe+i|| —>• 0, therefore \\x^ — —t 0 via the - 

update of I-FBS. This implies that —y^|| —>■ 0, because ||a;* —y^|| < \\x^~^ —y^\\ + 
— x^\\. Finally, using the - update of I-FBS we infer that 


lim d(V/(i/'=)-b 5^,5(?/'=), 0) = lim d(V/(x'=)-b 5e^g(x'=), 0) = 0. 

k—¥oo k—¥oo 


(4.3) 


Proof of statement (iii). If x"'" is a subsequence which weakly converges to 
x', then the y^+^-update of I-FBS implies y*''' also weakly converges to x'. This, 
combined with the x*^■'"^-update implies that x' G X*. Suppose that for any x* € A*, 
the sequence ||x^ — x*|| has a limit. This implies the sequence x^ is bounded and 
therefore it has at least one weakly-convergent subsequence, x'^'' (ordinary convergence 
in K”). By the above reasoning the limit of this subsequence, x must be in X*. 
Furthermore limj, ||x^ — x|| exists. Consider another subsequence x'^'* which converges 
to x' e X*. By considering the fact that lim^ ||x'^'= — x|p = lim/j ||x'^'' — x|P and the 
corresponding statement for x', one can see that ||x — x'|| = 0. Therefore the set of 
weakly convergent subsequences is the singleton {x}. Thus x^ weakly converges to 
X G X* (This is Opial’s Lemma |43 ) ). 
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Assume X* is non-empty. We now proceed to show that, for any x* G A*, the 
sequence ||a;^ —x*|| has a limit. Our proof closely follows Moudafi and Oliny’s analysis 
[35| . and is similar to the later variants [UlIlHl- The main difference is we allow for 
ttfc to be 1 for a finite number of iterations. Fix x'* G X* and define (pk = — a;*|p. 

Now 


tpk - P^k+i = -||Afc+i|p -h ak{Ak,x* - x^~^^). (4.4) 


Since 


_^fc+i + yk+i _ AfcV/(/+i) G Xkd,,g{x'^+^), 

—Xk^fix*) G Xkde^gix*) and {d^gix’^'^^) — degix*),x^**^ — x*) > — e, it follows that 


_ yk +1 + Afe(V/(/ + l) - Xfix*)),X* - ; 


A+1 


) > —AfeCfc. 


(4.5) 


Combining (4.4) and (4.5) we obtain 

ipk - P>k+i > ^||Afc+i|p Afe(V/(y''+^) - V/(a:*),x''+^ - x*) - afc(Afc,x'"+^ - x*) 


Now 


^k^k- 

(Afe, - X*) = (Afc, x'" - X*) + (Afe, Afc+i) 

= fk — <Pk-l + “A (Afc, Afe+i). 


(4.6) 


(4.7) 


Combining (4.6) and (4.7) yields 


^kip^k P^k—i) "A r>ll^fc+iir 4”Crfc(A/,;,A/^_|_i)-l- ||A/,;|| 


■^lAfe+ill^ + afc(A/c, Afe+i) -I- —' 
-Afe(V/(/+i)-V/(x*),x'=+i-x*) 

+Afeefc. (4.8) 

Now we use the fact that V/ is cocoercive as follows. Inequality ( |2.3[ ) implies 
Afe(V/(/+i) - V/(x*),x'=+i - X*) = Afe((V/(y'=+i) - V/(x*),y'=+i - x*) 

+ (V/(/+'-V/(x*),x'=+i-y'=+')) 

|2 


> ^(l|V/(2/'=+')-V/(x* 

+ (V/(/+') - V/(x*),x'=+i - y'=+')) 


(4.9) 


where (4.9) follows by completing the square. Combining (4.8) and (4.9) we infer 
‘fik+l — P’k — (Xki'Pk — fk-l) < ~2 Afe+l|P + Oik{Ak, Afe+i) -I- ^||Afc|P 

H—^11 A/c+1 — afeAfelp -|- AfeCfe 


ctfe f XL 


— 1 I ||Afc_|_i — A 


+ ^(4 + AL(a,-l))||A,f 


Ofc 

'T 

XL-2 


(1 ~ Ofe)|| Afe+ilp -I- AfcCfe. 


(4.10) 
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Note that the coefficients of ||Afe+i — and HAfc+ip are non-positive. Set 9k = 
ipk - <Pk-i and 


Sk^^i4 + \L{ak-l))\ 


AfcCfe 


(4.11) 


and note that < oo. 

The argument from now on is basically identical to |35j except we allow for se¬ 


quences Uk which are equal to 1 for a finite number of k. Restate (4.10) as 


(^k+l -|- Sk 

< Olk[9k]+ + 5k- 


(4.12) 


Since lim sup < 1, there exists an integer K > 0 and a G [0,1) such that < a < 1 
for all k > K. This and (4.12) imply that, ior k > K 

[^fc+l]+ ^ Qf[0fe]+ -|- Sk- 


Thus for fc > iC 


K 


[6k+i]+ < a 


k-K 


[^k]+ + ^a 

j=K 


k-jc I —k-Ksr^ X 
^Sj+a 

i=i 


Careful examination of this expression yields 


OO K _/ oo \ 

^[^/c+l]-|- < K ^ Sk + _ _ I [S'!]-!- + ^ Sk \ < oo. 


(4.13) 


fc =0 


/c=0 


k=K 


Set Wk = + - Since (fk > 0 and X)[‘5^i]-l- < oo, Wfc is bounded from below. 

Wk is non-increasing, therefore we have it converges. Therefore (pk converges for every 
X* G X*. By invoking Opial’s Lemma, statement (vi) is established. □ 

Theorem |4.1| does not apply to the FISTA-like parameter choices because for all of 
these choices ak —>■ 1- However the theorem does apply if we make the following modi¬ 
fication. Replace the momentum parameter sequence {afejfceN with min(Q;fc,a) where 
a < 1. This parameter choice satisfies the assumptions of the theorem, and a can 
be chosen arbitrarily close to 1. However the 0(l/fc^) objective function convergence 
rate is no longer guaranteed once ak exceeds a. 


In the following Corollary, we use (4.2) to determine explicit bounds on ||Afc| 


which will be useful in the analysis of Problem SO. 

Corollary 4.2. Suppose that Assumption 1 holds. Assume {AfcjfcGN *5 non¬ 
decreasing and satisfies 0 < Afc < 1/L for all k, {efcjfceN satisfies > 0 for all k and 
J2k < 00 , there exists a G [0,1) such that {ofelfegN satisfies 0 < ctk < cx for all k. 
Then for the iterates of TFBS, 




F{x^)-F* + 




2L{l-a) - 1 


fc=i / 


If, in addition, there exists a € [0, a] such that > a for all k, then 


(4.14) 


fc=l —^ \ k=l / 


(4.15) 
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Proof. Inequality (4.2) implies 


II A ii2 , ^ 4” ^k{^k 1) ttfc tt/c-l-l II ^ ||2 


F{x^)-F* + ^\\A,r+Y,ek>Y.- 


k=l k=l 


2Xk 


ife+i| 




k=l 


2 — 2q — Afc 
2Afc^ ' 


^fe+il 


>E -- - -- o — 


ifc+i| 


(4.16) 

(4.17) 


k=l 


which proves (4.14). To derive (4.16) we used the fact that 0 < Ofc < a. To derive 
(4.17) we used the fact that XL < 1. 

Inequality (4.2) also implies 


*^1 II A ii2 , Lak{^ Oik) ||^ ||2 


+ E^fc>E' 


/c=l /c=l 

OO r2 


2Xk 


iT. 




k=l 


which proves (4.15). 


5. Convergence Analysis of I-FBS for Sparse Optimization. 


5.1. Finite Convergence Results. We now turn our attention to Problem SO. 
The following theorem proves finite convergence to 0 for the components in D, and 
finite convergence to the correct sign for the components in E (recall the definitions 
of D and E in Section 2.4). Following the terminology of [30] we will refer to this as 
the “finite manifold identification period”. The manifold in this case is the half-space 
of vectors with support a subset of E and non-zero components with sign —h*/p. 
This theorem generalizes the result of Theorem 4.5 in [T7| from ISTA to I-FBS. For 
simplicity, we only consider the case where Ck is 0 for all fc, meaning the proximal 
operator is computed exactly. Thus the results are stated for I-FBS not I-FBS-e. Note 
that the proximal operator w.r.t. the norm is relatively easy to compute as it is 
in seperable closed form, thus we do not think it is worth considering I-FBS-e in this 
case. In the next subsection, we consider the FISTA-like methods. 

Theorem 5.1. Suppose that Assumption SO holds. Assume {Afej^gN is non¬ 
decreasing and satisfies 0 < Afe < 1/L, and there exists a,a G [0,1) such that {afcjfceN 
satisfies a<ak<a for all k. Then, there exist constants Kd > 0 and Ke > 0 such 
that the iterates of I-FBS applied to Problem SO satisfy, for all k > Ke, 


sgn{y^ - XkXJf{y%) = G E, 

P 


(5.1) 


and, for all k > Ke 


x^f = y\ = 0, V^ e E. 


(5.2) 
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Furthermore, Ke < Ke and Ke < Kd, where 
'2a{l + a) (^Fix^) - F* + 


Ke = 


p^Xl 


a{l — a)L‘^ 


\x^ — x*\\^ — a\\x^ — x*\\'^ 


+ 


1 — a 


(5.3) 


and 


Kd = 


oj'^Xl 


2a{l + a) (^Fix^) - F* + 


a(l — a)L2 


\\x^ — x*\\^ — a\\x^ — x*\\'^ 


+ -^ + 2 
I — a 


(5.4) 


for any x* S X*. 

Proof. Note that this parameter choice satishes the requirements of Theorem |4.1| 
and Corollary |4.2[ Furthermore, by assumption, X* is non-empty and F* > —oo, 
thus all conclusions of Theorem |4.1| and Corollary |4.2| hold. Throughout the proof, 
fix an arbitrary x* C X*. 


Proof of ( |5.1[ ). Fix a A > 0. Recall from Theorem |2.1| there exists a vector h* 
such that V/(x*) = h* for all x* G X*, and that supp(a::*) C E. For i G supp(x*). 


O^x* = sgn(a;- - Xh*) [\x* -Xh*\- pA]_^ . (5.5) 

Therefore \x* — Xh*\ > pX for all i G supp(a;*). On the other hand, if z S i?\supp(x*), 
then 


|a::-Ah*| = A|/z*|=pA. 


Therefore 


\x* - Xh*\ > pX, Vz e E. 


Looking at (5.5) it can be seen that 

sgn(a;-) = sgn(a;- - Xh*), Vz S supp(a;*). 


(5.6) 


Note by Theorem |2.1[ if z S supp(a;*), then sgn(a:*) = —h*/p. Else if z G ii'\supp(a;* 
then 


sgn(x* - Xh*) = sgn(-A/i*) = -sgn(/z*) = - ^. 

P 


Combining (5.6) and (5.7) yields 


sgn{x* — Xh*) =—h*/p VzGE, A>0. 


(5.7) 


(5.8) 


Let Vk = pXk- If 

sgn - AfcV/(/+^)i) ^ sgn(a:* - Xkh*) = -h*/p for some z G E, (5.9) 
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then Lemma |2.3| implies 


1x^1 - - AfcV/(/+i).) - - Xkh*) 

< - AfeV/(t/'=+i), - {x* - Afch*)| - 

< | 2 /f+i - A,V/(y'=+i), - {x* - Xkh*)f - 4 


where (5.10) follows because 


Hvr^ - AfcV/(2/'=+^)0 - {x* - Xkh*)\ > |(x* - Xkh:)\ >Uk>0. 


Using (5.10) we can say the following: Condition (5.9) implies that 


ll^/c+l _ ^*||2 ^ ^ i^fe+1 _ ^*|2 + la-fe+l _ ^*|2 

3¥=i 

+ | 2 /f+i - AfeV/(y'=+i). - {x* - Xuh*f - ul 
< 112/'=+' - AfeV/(/+i) - (x* - Xkh*)f - vl 

<|!yfc+l-^*||2-^2 

= llx'" -f afeAfe - x*f - vl 

= \\x^ - x*|p -f a^llAfclp -b 2ak{Ak,x'^ - x*) - vl. 


(5.10) 


(5.11) 

(5.12) 

(5.13) 


Inequality (5.11) follows from the element-wise nonexpansiveness of Si, along with 


(5.10). To deduce (5.12), we used Lemma 2.2 Finally, (5.13) follows because {AfcjfcgN 
is non-decreasing and therefore so is {vk}. 

Recall the definition of (pk = W \x^ — a;*|p and Ok = — Vk-i- Now, moving 

(Afc, Afe+i) to the other side of (4.7) reveals 


(Afc,x*^ - X*) = ipk- ipk-i + 


(5.14) 


Substituting (5.14) into (5.13) yields 


2{(pk+i - (fik) < ^aki^Pk - ^k-i) + a{l + a)||Afc|p - vf, 


therefore 


Ok+i < akOk H—E—^||Afc|p 


V 


2 

1 


2 ■ 


Repeating the arguments that led to (4.13), we can say the following: if (5.9) is true 
then 


Ok + l S: Q^Ol + 


a(l 


1=1 ^ 1=1 


2 


(5.15) 
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Therefore, for M G Z+, if (5.9) holds at iteration M, then 

M 

‘PM - po = 


fc=l 




< 


< 


1 — a 

9i a{l 


_/- , _\ M o M k — 1 

k—1 k—lj=0 


2(1 


M 


1 -a 2(1-a) 




k=l 


a 


M 


1 — a (1 — ay 


(5.16) 


To derive (5.16) we lower bounded the coefficient of — Since pk > 0, if (5.9) is 
true at iteration k then 


k < 


2(1-a) 


a(l + a) 
2 ( 1 - 0 ) 




fell^ + 


|a;° — x*\y 


Oi 


fc=i 


1 — a 


1 — cx 


(5.17) 




2 a(l + a) (F(a;^) 


_ JP* 


ailAil 


a{l — a)L‘^ 


la:^ — x*\y — a\\x^ — x*\y 


a 


1 — a 


(5.18) 


To derive (5.18) we used the upper bound on ||Afc||^ in (4.15) from Corollary 
This upper bound is tighter than the other upper bound for ||A 
so long as L > 2/a. 


4.2 


k ll^fcir given in (4.14), 


Proof of (5.2). Recall the definition of uj and note that 
AfeW = min{z 2 fc - \x* - Xkh*\ : i G D} > 0 
Consider i G D (which implies i ^ supp(a:*)). If y 0, then Lemma 


2.3 


(5.19) 

implies 


- AfcV/( 2 /'=+i),) - S., (x* - XkK) y 

< [|( 2 //+i - AfcV/(yf+i)) - {x* - Xkht)\- {vk - \x* - Xkh*\ f 

< l(j/f+' - AfeV/(yf+i)) - (a)* - Xkh*)y - {uk - k* - XkK\f (5.20) 

< - AfeV/(yf+i)) - {x* - Xkh^y - co^Xl (5.21) 

To derive ( 5.20[ ) we used the fact that 

l(2/f+' - AfeV/(2//+i)) + Xkh*\ >uk- XklKl 


To derive (5.21) we used (5.19). Repeating the arguments used to prove (5.13) we 
can say the following. If there exists i G D, such that x^'^^ y 0, then k < Ku — 2^ 
with Kjy defined in (5.4). Therefore \x^\ = 0 for all fc > Ku — 2. Since = 
Xi + ak{x^ — x^~^), = 0 for all i G D and k>Ko — ‘2- + 2 = Ko, which proves 

dt:^. 


□ 

Note that if x^'^^ y 0 then sgn(a:^'''^) = yf — XkX/f{y’^)i, thus (5.1) implies 
convergence in sign. We can recover the result by Hale et al. for ISTA (Theorem 4.5 
[HI)- To see this, consider ( |5.17[ ) with a = a = 0 and then use the upper bound on 
||Afc|p given in (4.14) of Corollary 4.2 Note that we defined the constant a; in a 
slightly differently way to Hale et al. 
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5.2. Finite Convergence of FISTA. We can prove convergence to the optimal 
manifold in a finite number of iterations under more general conditions than required 
in Theorem |5.1[ however without explicit bounds on the number of iterations. A 
corollary of the following theorem is that the FISTA-like choices proposed by Beck 
and Teboulle [16] and Chambolle and Dossal m achieve hnite manifold identification. 


Theorem 5.2. Suppose that Assumption SO holds. Assume {AfcjfcgN is non¬ 
decreasing and satisfies 0 < < 2/L for all k, and there exists o > 0 such that 

{afc}/cgN satisfies 0 < ak < a, for all k. If, for the iterates of I-FBS applied to 
Problem SO, it is true that lAfclP < ~ 2 ;*|| is bounded for some 

X* S X* and for all k, then there exists a constant A > 0 such that for the iterates 
of I-FBS (5.1 ) and | f5..2[) ho ld for all k > K. 


Proof. Inequality (5.13) and the equivalent recursion for when (5.9) holds can 


be proved in exactly the same way. However, we cannot rely on o < 1, so we have 
to modify the proof from that point onwards. Once again, fix x* € X*. Rewriting 


(5.13), we can say that (5.9) implies that 


l^k+i _ ^*||2 < ii^fe _ 3.*||2 +^2||Afcf + 2a(Ak,x'^ - x*) - v't 

^*112 I ^ 2 \ 


< \\x^-x*r+a^\\Akr + 


2 + 2a' ^ 

2 I r,— 


< \\x^ -X 


*l|2 I ^2 


2a\\Ak\ 

2 I 0^)1 


x’^-x*\\-ul 


+ ||Afc|j -\-2aMi\\Ak\\ — r'l 


(5.22) 

(5.23) 


where we used the Cauchy-Schwarz inequality to get (5.22). To derive (5.23) we 


used the assumption that there exists Mi > 0 such that \\x^ — x* 
assumption, there exists M 2 > 0 such that 


< Ml. Also by 


Y.\\Akr<M2. 


(5.24) 




Now, by Jensen’s inequality 


E 11^*11^ 


2=0 


^ fcE 11^*1 

\ 2=0 


< ijM2k. 


(5.25) 


Substituting ( |5.24[ ) and ( |5.25| ) into ( |5.23[ ) yields the following: If ( |5.9[ ) is true then 
— x*\\^ < ||a;° — x*\\^ + a‘^M 2 + 2aMi\/M^ — kvf (5.26) 


proves (5.1). 


The r.h.s. of (5.26) can be non-negative for only a hnite number of iterations, which 


For (|5.2|), the recursion is 

„*l|2 


- X 


< x'^ -x‘ 


.* 112 I —2 


+ a^\\Akr + 2a{Ak,x'^-x*)-Lu^Xi 


and the reasoning is the same from this point onwards. □ 

The classical FISTA parameter choice in (3.1) due to Beck and Teboulle [16], 
along with others which guarantee the 0(1/k^) rate provided by Tseng [5^, satisfy 
the assumptions of Theorem ] 5. 2] so long as F is coercive (or equivalently has bounded 
level-sets). The hrst condition, lAfelP < be shown by considering the 
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following facts. The sequence bk defined on page 196 of m is bounded, which implies 
the sequence uu defined on page 194 of [T5] is also bounded. If F is coercive, then 
is bounded, since F{x^) —>■ F*. This implies ||a;^ — a;*|j is bounded for some x* G X*. 
It also implies tfcAfc is bounded and since tk = 0(k), ||Afe|| = 0(l//c), and the result 
followS _ 

satisfies the assump- 
^ is finite 


The parameter choice (3.2) due to Chambolle and Dossal 
tions of this theorem, even when F is not assumed to be coercive. II 


by Corollary 2 of [37] and — x*|| is shown to be bounded for all x* G X* in the 
proof of Theorem 3 of |37|. (In fact Chambolle and Dossal proved that J2k is 

finite for their parameter choice.) 


5.3. Finite Reduction to Local Minimization. Theorems 15.11 and 15.21 allow 
us to characterize the behavior of I-FBS (including the FISTA-like choices) after a 
finite manifold identification period. In the following corollary, we show that after 
a finite number of iterations, TFBS reduces to minimizing a smooth function over 
E subject to an orthant constraint. The following corollary generalizes the result of 
Corollary 4.6 in [IT] from ISTA to TFBS. 

Corollary 5.3. Suppose that Assumption SO holds. Assume {AfejfegN is non¬ 
decreasing and satisfies 0 < Afc < 1/L, and there exists a,a G [0,1) such that a<a 
and {afejfcgN satisfies a < a*, < a for all k. Then, after finitely many iterations, the 
iterates of FFBS applied to Problem SO become equivalent to the iterates of TFBS 
applied to minimizing -G K, where 


fi{xE) = -{h*EfxE + f {{xe, 0)), 


(5.27) 


constrained to the orthant Oe, where 

Oe — {xe G : —sgn{h*) Xi >0, Wi G E}. (5.28) 


Specifically, there exists K > 0 such that for all k > K, 

= x% + ak(x% — x^ ^) (5.29) 

= Po. - AfeV,^(2/|+i)) , (5.30) 


x’h = ?/£) = and F(x^) = fi jx^). Furthermore K < inax{K e, Ke}, with Ke and 


Ke defined in (5.3) and {5.4). 

Altern a tively , if the conditions of Theorem \5.^ hold, then there exists K' > 0 such 
that {5.29)-{5.30) hold, Xe = = and F{x^) = (j){x%), for all k > K'. 

Proof. From Theorem O there exists a K such that for all k > K, (|5.1|) and 


(5.2) hold and K < maxjAr^), AT^}. Take k > K. Since xf = 0 for all i G D it suffices 
to consider i G E. For i G E, k > K, using 05.1 


we have 


a;* > 0 if sgn - XkXf{y^~^^)i) = 1 equivalently h* < 0 


and 


x'( < 0 if sgn (yf+i - XkXfiy^^^)i) = -1 equivalently h* > 0. 

Therefore for any i G E, —h^x^ > 0, thus x^ G Oe for all k > K. Next note 
that, for all i G E, —h*x^ = p\xi\. Therefore for k > K,— {h*Y'x^E — pII^^IIii thus 
F{x'^) = (t>{x%). 


^We thank Antonin Chambolle and Charles Dossal for pointing this out to us. 
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Now for i G E, k > K, we calculate the quantity 

= 2/^'-Afe(-h* + V/(y'=+iW 

= 2/f+'-AfeV/(2/'=+i),+pAfc(^) 

= sgn (yf+i - AfeV/(y'=+i).) (|yf+i - A,.V/(/+i),| - pX^). 
Therefore, ior i G E, k > K, 

yf+^=x^ + a,{x^-xt^), 

= Spx, (2/f+' - AfcV/(/+i).) = I 


: -h*z’y+^ > 0 
0 : else 


Equivalently, for k > K, 


=x% + ak{x% - 2^1 ^)> 


Due to Theorem |5.2[ the same arguments hold for parameter choices such that 
J2k lAfclP is finite, \\x'‘ — a;*|| is bounded for all k and some x* G X*, and ak is 
bounded. However there is no explicit upper bound on K. 

□ 


In principle one could switch to minimizing (j) directly once the algorithm has 
reduced to (5.29l-(5.30|. This would allow for a larger step-size, since the Lipschitz 
constant of V(/) is less than L. However it is not possible to know with certainty that 
the algorithm has transitioned to the form (5.29)-(3.2) unless the number of iterations 
exceeds the upper bound max{KD, Ke}, although we discuss some heuristics for 
identifying this transition in Section |5.5| The main drawback of this strategy is that 
once it switches to minimizing (/> directly the support of x^ is fixed. Therefore any 
mismatch between supp(x^) and supp(a;*) is not identified and the algorithm will not 
necessarily converge to an optimal point. In the next section we discuss a method 
which uses the optimal momentum for minimizing (j) yet continues to use a smaller 
step-size and is therefore guaranteed to converge to a minimizer. 


5.4. A Simple Locally Optimal Parameter Choice for I-FBS. The analy¬ 
sis of the previous three sections shows that, after a finite number of iterations, I-FBS 
(subject to parameter conditions) reduces to minimizing the function (j) subject to an 
orthant constraint. Even though / is not assumed to be strongly convex, (j) might be. 
If this function is strongly convex, the asymptotic rate of convergence can be deter¬ 
mined by the worst-case condition number of the Hessian. Throughout this section 
let Let Hee{v) be the Hessian of cj) evaluated at v. In terms of strategies for choosing 
{afcjfegN and {AfejfegN, one approach is to choose them to obtain the best iteration 
complexity for minimizing (j). In the following Corollary, we provide a simple fixed 
choice which does this and thus optimizes the asymptotic iteration complexity. 

Corollary 5.4. Suppose that Assumption SO holds, and </> is strongly con¬ 
vex. Let X* he the unique minimizer of Problem SO and Ie be the strong convexity 
parameter of 4>. If X G (0, 1/L], 


1 — x/IeX 
1 -l- x/IeX 




Afc = A and ak 


Vfc e z. 


(5.31) 
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then the iterates of I-FBS converge to x* linearly and F{x^) converges to F* 

linearly. Indeed 


F (x'^) - F* = O (^{l - ^/IeX)'^ . 


(5.32) 


Proof. The analysis of the previous sections shows that, for the given choice of 
{•TfelfeGN and {AfcjfegN there exists a K such that, for all k > K, (5.29) and (5.30) 
hold, and F{x'^) = Thus for k > K the algorithm is equivalent to Nesterov’s 

method (Section 2.2 applied to the strongly convex function (f subject to the 
orthant constraint Oe- Therefore we apply Theorem 2.2.1 of with the fixed 
parameter choice (constant scheme 3, discussed on p. 76 of [25]). The only difference 
compared to Theorem 2.2.1 is that we allow for step-sizes other than IfpE, where Le 
is the Lipschitz constant of Wcj). Note that L > Le. This minor change is discussed 
on p. 72. of [15]. Setting X = 1/L (the maximum allowed step-size) gives: 


F (x'=) -F* =0 



(5.33) 


Another minor issue to note is that the minimization is constrained to the simple 
convex set Oe- This does not effect the convergence of Nesterov’s method, as discussed 
in Constant Step Scheme (2.2.17) of [15] . 

By the strong convexity of (/>, the sequence also achieves linear conver¬ 
gence with the same iteration complexity. □ 

The iteration complexity with this parameter choice is 


n 



(5.34) 


This is the best asymptotic iteration complexity that can be achieved by TFBS using 
this step-size [15]. Therefore we will refer to it as the locally optimal choice. Indeed 
it is better than the iteration complexity of ISTA [17] (which corresponds to I-FBS 
with ak equal to 0) which is 



We will see in the next section that ( |5.34 ) is better than the iteration complexity 
achieved by the FISTA-like choices of [T5], [15] and |17j . 

In practice the optimal momentum is not known a priori as it depends on the 
smallest eigenvalue of Hee- The momentum could be estimated periodically based 
on the smallest eigenvalue of the Hessian corresponding to the current support set, or 


adapted based on the behavior of F{x^) (see Section 5.6 and Section]^ 

The authors of |34] proposed an adaptive momentum restart scheme for Nesterov’s 
method in the case of smooth optimization (i.e. g{x) = 0). Corollary 5.3 implies that 
the scheme can also be used for I-FBS applied to Problem SO. This follows because 
TFBS with parameter choice (3.1) reduces to minimizing a smooth function (j) after 
a finite number of iterations, after which the momentum restart scheme can be used. 
Referring to the analysis of [31], it can be shown that the method will have the same 
iteration complexity as given in (5.34). 
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Local linear convergence can also be proved when the local function </> is not 
strongly convex, but the limit point of the iterations obeys the “strict-complementarity” 
condition: E = supp(a;*). Furthermore the Hessian matrix of (j) must be invariant in 
a region containing the limit. In the following corollary, let x* = limfc_>oo x^, which 
is in X* by Theorem 4T 

Corollary 5.5. Suppose Assumption SO holds and E = supp(a:*), where 


lim, 


= X* € X*. Let Hee{x) he the Hessian of the function </> defined in 


(5.21). Let Ie be the smallest non-zero eigenvalue of Hee(x*e). Assume the range 
space of Hee is invariant in some neighborhood N* around x*. If all eigenvalues 
of Hee{x*e) are zero, x^ = x* after a finite nu mber of iteration s, fo r any choice 
of {Xk,C(k} satisfying the conditions of Theorem 
Ag (0,1/L], 


5.1 


or Theorem 


5.2 


If Ie > 0, 


Xk — X and (Xk — 


1 — VIeX 
1 + ^/IeX 


V/fc G Z 




(5.35) 


then the iterates x^ of TFBS converge to x* linearly and F{x^) converges to F* 
linearly. Indeed 


F {x'^) -F* = 0 



Proof. 

The proof proceeds almost identically to Theorem 4.11 of m- Note that if 
a;^ X* than x*. Now Lemma 5.3 of m can be directly applied to I-FBS to 
say that, after a finite number of iterations. 


- h*) \/i G supp(a;*). 


(5.36) 


Assume k is large enough that (5.36) holds, x^ G N*, and k > ma.x{KD,KE}. Since 
x^ is 0 for all i G D, it suffices to consider i £ E = supp(a;*). Recall that H{x) is the 




Hessian of / evaluated at x. Now, let H be defined as 




= [ H{x*+t{x'‘-x*))dt 

Jo 




By assumption the range spaces of H are now invariant over k. For a matrix W, let 
Wee be the \E\ x \E\ submatrix of W with row and column indices given by E. Let 
P be the orthogonal projection onto the range space of Hee- Since E = supp(a:*), 


equation (5.36) can be used to claim that 


4+' = 4+' - a(v/(2/'=+1) - h*)E = 4+' - XH 


eeKVe 


,). (5.37) 


This follows from the mean value theorem. At each iteration the term —XH^eIv’^^ ~ 
x*e) stays in the range space of Heez which implies that the null-space components 
of the iterates have already converged. In other words, for k sufficiently large, 

{I-P){x%-x*e) = 0. 


li Ie > 0, it suffices to consider the convergence of {Px^}, that is, consider the 
component in the range space of Hee- Since (f restricted to the range space ot Hee 
is strongly convex, we now simply repeat the arguments of Corollary |5.3| and the result 
follows. □ 
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5.5. “Underdamped” I-FBS. In [34j, the behavior of the FISTA-like meth¬ 
ods when applied to strongly convex functions was investigated. It was shown that 
for such functions if « 1, the algorithm moves into what is known as an “under¬ 
damped regime”, which leads to oscillations in the objective function at a predictable 
frequency, and a sub-optimal iteration complexity. The results of the preceding sec¬ 
tions allow us to extend this analysis to Problem SO, despite it being nonsmooth and 
not strictly convex. 

The analysis of [31] revealed that if « I, the trace of the objective function val¬ 
ues will oscillate with a frequency proportional to 1/v^ where k is the condition num¬ 
ber of the Hessian at the minimum. The iteration complexity in the high-momentum 
regime with step-size A = 1/L is 


ft [^Klog 

More precisely, the behavior F{x^) « (7(1 — cos^{k/^/ k) is observed. Now Corol¬ 
lary 5.3 shows that after a finite number of iterations, FISTA (with parametric con¬ 
straints) reduces to minimizing (j) (defined in ( 5. 27] )) subject to an orthant constraint. 
Therefore we can apply the analysis of |31j once the algorithm is in this regime. Thus 
if Ie > 0, FISTA obeys the conditions of Theorem |5.1| or and at —1, then the 
iteration complexity will be |34j 

which is worse than the iteration complexity achieved with the locally optimal choice 
given in Corol lary |5.4[ The trace of the objective function will exhibit oscillations 
with period This result directly applies to the FISTA-like choices of [16], 

[53] and [57]. The result also applies when Ie = 0 under the stric t-complementarity 
condition. In this case replace Ie with Ie defined in Corollary 


5.5 


5.6. An Adaptive Modification. In numerical experiments we have noticed 
that it can take many iterations for the optimal manifold to be identified by TFBS. 
This means that one of the FISTA-like choices can outperform I-FBS with our locally 


optimal choice (5.31) before the optimal manifold is identified. This is because the 
FISTA-like choices guarantee 0{l/k^) convergence during this phase whereas the 
locally optimal choice does not have a guaranteed rate until the optimal manifold is 
identified (although an 0(1/k) rate can probably be established following the analysis 
of [57], however this is beyond the scope this paper). On the other hand the analysis 
of the previous section showed that the FISTA-like choices have poor performance 
once the algorithm is in the optimal manifold. In summary the FISTA-like choices 
have excellent global properties but poor local properties. 

In light of this we propose the following adaptive heuristic. We use the condition, 
F{x^) > F{x^~^), as an indication the algorithm is at least approximately operating 
in the optimal manifold. This is because with a FISTA-like choice the algorithm will 
eventually converge to the optimal manifold and then the function values will start 
to oscillate. So the adaptive modification is the following. Use Beck and Teboulle’s 


parameter choice of (3.1) until F{x^) > F{x^ ^). For all iterations after, use the 


locally optimal momentum given in (5.31). We call this scheme FISTA-AdOPT. See 


Experiment 1 for empirical results. It is worth mentioning that it is better to use the 
condition {x^^^ — x^) > 0 rather than F{x^) > F{x^~^). It was shown 
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in [34] that the two conditions are equivalent however the first avoids computation of 
F. 


The iterates of FISTA-AdOPT are guaranteed to converge to a solution and, 
until F{x^) > F{x^~^) occurs, the convergence in the objective function is 0(l//c^). 
Furthermore, the asymptotic convergence rate is optimal and given by (5.33). The 
main drawback of the scheme is that it might switch to the locally optimal parameter 
choice before the iterates are confined to the optimal manifold. Another drawback 
is that the locally optimal momentum parameter must be estimated, which involves 
computing the largest and smallest singular values of A confined to the current support 
set. This computation is nontrivial when the support set is large. A better alternative 
is to use the adaptive restart scheme proposed in [34] . Thanks to our analysis, this 
method is guaranteed to achieve the same asymptotic iteration complexity as the 
locally optimal parameter choice, but will also achieve the 0(l/fc^) performance in 
the transient regime prior to manifold identification (See the remarks after Corollary 


5.4). The practical performance of the momentum restart scheme and FISTA-AdOPT 


are compared in the next section. 


5.7. The Splitting Inertial Proximal Method. In [35], Moudafi and Oliny 
introduced the Splitting Inertial Proximal Method (SIPM): 


^k+l 


e + a;'" - XkVf{x'^) + akix'^ - x'^ ^). 


(5.38) 


In fact they introduced it for the more general monotone inclusion problem. The 
method is a direct generalization of Polyak’s heavy-ball with friction method to prob¬ 
lems involving the sum of two functions. It differs from I-FBS in that the gradient 
w.r.t. / is computed at x*’ rather than the extrapolated point x^ + ak{x^ — x^~^). 
Our analysis of I-FBS in the case of sparse optimization can be extended easily to 
Moudafi and Oliny’s method under the condition that HAfcjp is finite, for which 
sufficient conditions were established in [35]. 

Theorem 5.6. Suppose that Assumption SO holds. Assume Ck is 0 for all k G N, 
0 < Afc < 2/L for all k, and there exists a > 0 such that 0 < a*, < a for all k. If 
< oo, and —a;*|| is bounded for all x* G X*, then there exists a constant 
K > 0 such that, for all k > K the iterates of Algorithm (5.38) applied to Problem 
SO satisfy 


sgn{xi - XkXf{x^)i + ak {xi - x^ yi G E, 

p 


and 


x’l =0, yi G D. 


Proof. The proof follows in a similar way to Theorems |5.1| and |5.2| 


(5.22) is proved by following similar arguments as in the proof of Theorem 5.1 
include the salient differences. 

Recall that i^k = pXk. If 


Equation 
We 


sgn(a;f - XkXf{x'^)i + akix^ - x'l ^)) yf sgn(a;* - Xkh*) 

for some i G E, 


-KIp 

(5.39) 
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then, 




*|2 


= 15., {xt - \kyf{x% + ak {xt - x^l)) - S,,{x* - Xkh*)] 

< |a;f - Xk'^f{x% + ak (x^ - x^~^) - (x* - Xkh*)f - vl (5.40) 


where (5.40) follows for the same reasons as given for ( |5.10 ). Continuing to follow the 
proof of Theorem |5.1l we say the following: if (5.39) holds, then 


ll^fc+l _ 2.*|j2 ^ ^ |3.fc+l _ ^*|2 + la-fc+l _ ^*|2 

< ^ \x) - XkVfix^)j + ak {x’; - x';-^) - (x* - Xkh*)f 

+ \xi - XkVf{x'^)^ + ak {Xi - x’l~^) - {x* - 

-ul ( 5 . 41 ) 

< ||x'= - XkVfix^) + akAk - {x* - Xkh*)f - 

< Wx’^ - x*f + a^Ak\\^ + 2a\\x>^ - x*\\\\Ak\\ ( 5 . 42 ) 


Equation (5.41) follows from the element-wise nonexpansiveness of Si, and (5.40), 
and (5.42) follows from the nonexpansiveness of / — AV/, by application of Cauchy- 
Schwartz, and by substituting the upper bound a for ak- Equation (5.42) is identical 
to ( |5.22 ). From this point on, the proof is identical to Theorem 5.2 As before we 
cannot explicitly bound the number of iterations unless we know an upper bound for 

Ek^l 

Moudafi and Oliny provided a condition on the parameters under which 


□ 




and llcc^ — a:*|| is bounded for all x* € X*. Choose the sequence {a^lfegN to be non¬ 
decreasing, the constant a < 1/3, and the sequence {AfcjfcgN to be bounded away 
from 2/L. They showed that this implies \\x^ — x*|| is bounded for all k and x* G X* 
and 


^a^llAfclp < oo. 

k 

Now using the fact that ak is less than 1/3, ||Afe|p < oo. If ak is zero for all k 
the theorem follows from m since the algorithm reduces to ISTA. 

We have proved that finite convergence in sign on E and to 0 on H holds for 
SIPM applied to Problem SO. An analogous result to Corollary |5.3| can now be shown. 
After a finite number of iterations, SIPM reduces to HBF projected onto a quadrant. 
Because of its similarity to Corollary |5.3[ the proof is omitted. 

Corollary 5.7. Assume 0 < Aj, < 2/L for all k, and there exists a > 0 such 
that 0 < afc < a for all k. If J2k ~ ^*11 *■5 bounded for all 

X* G X*, then there exists a K > 0 such that for k > K, the iterates of SIPM satisfy 

{x% - AfcV(/(x|) + ak {x% - x%~^)) 

Xd=Q 

Fix^) = <f{x%) 
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where 4> and Oe are defined in (5.21) and (5.28) respectively. 


Since SIPM reduces to projected HBF, parameter choices could be made to opti¬ 
mize the performance of HBF on the optimal manifold. Such computations have been 
carried out elsewhere [15] for the special case of Problem ^i-LS. The analysis closely 
follows the original work by Polyak for HBF |35|. For strongly convex quadratic 
functions, HBF obtains linear convergence with a -yK-times lower rate than gradient 
descent, where k is the condition number of the Hessian. For Problem £i-LS, when 
Ie > 0 the optimal asymptotic iteration complexity of HBF turns out to be equal to 


that of I-FBS with our optimal choice given in (5.341. However we can only guarantee 


local linear convergence for non-decreasing choices of ak in the range [0,a] with a less 
than 1/3. So the locally optimal value of the momentum must be less than 1/3 for 
this to hold. Otherwise convergence is linear with a worse iteration complexity |45| . 

6. Numerical Simulations. We now compare several choices of parameters for 
1-FBS applied to a random instance of Problem £i-LS. To compute E and thus find 


the locally optimal parameter choice given in (5.31), we use the interior point solver of 
[46] to find a solution to a target duality gap of 10“®. We then compute h* = V/(a;*), 
and approximate the set E by the set of all entries such that p — \h*\ is smaller than 
10“^. We also use the interior point solver to find an estimate of F*. Recall that Ie 
denotes the smallest eigenvalue of A^Ae and note that Ie was greater than 0 in all 
experiments we ran. The parameter choices under consideration will be referred to 
by the following designations. 

• I-FBS-OPT. I-FBS with our locally optimal parameters derived in Section 


5.4 given in (5.31), with A = 1/L. Recall that this choice optimizes the local 


convergence rate. Note that this is not a practically implementable algorithm 
as it depends on the optimal momentum. We include it to verify the theory 
of Section [U 

ISTA-OPT |17] . ISTA with parameters chosen to optimize the local conver¬ 
gence rate. Corresponds to I-FBS with ak = 0 and step-size Xk = 2/{L + Ie). 
As with I-FBS-OPT this step-size cannot be used in practice as it depends 
on Ie- Using Afc = 1/L is common in practice but gives worse convergence 
rate. 

FISTA-BT [16] . FISTA with Beck and Teboulle’s parameters given in (3.1). 
FISTA-AdOPT. Our proposed method, see Section [5/6] FISTA with Beck 
and Teboulle’s choice until F(a;^) > F{x’^~^), (equivalently 
x^) > 0), I-FBS-OPT for all iterations thereafter. We estimate the locally op¬ 
timal momentum by using supp(a;*’) as a surrogate for E, since supp(a;^) C E 
for k large enough. We compute the smallest eigenvalue of 4l^pp(',,,fc^Asupp(2,fe) 
one time when F{x^) > F{x^~^) and compute the locally optimal momentum 


using (5.31). We use this fixed value of the momentum from then on. This 
algorithm is practically implementable so long as | supp(a::^)| is not too large. 
FISTA-AdRe [34] . Adaptive momentum restart for FISTA. Beck and 


Teboulle’s parameter choice given in (3.1) except set tk to 0 whenever F{x^) > 


F{x^ (equivalently — x^^^Y {x^^^ — xY > 0). Our analysis shows 
this method achieves the optimal asymptotic iteration complexity derived in 


Corollary [5.3] (See the remarks after Corollary 5.4). 

All algorithms are initialized to x^ = x^ = 0. 

Experiment Details. We create a random instance of Problem £i-LS, with A of 
size 300 X 2000 and with entries drawn i.i.d. from A/'(0,0.01). The vector b is Axq with 
xq being 50-sparse having non-zero entries drawn i.i.d. A/'(0,1). The regularization 
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parameter p is set to 1. 

The results are shown in Fig. |6.1| Both FISTA-AdOPT and FISTA-AdRe inherit 
the 0{l/k^) convergence rate of FISTA-BT during the transient period. However they 
also have the optimal asymptotic rate of I-FBS-OPT. FISTA-BT begins to oscillate 
once the optimal manifold is identified and the iteration complexity is worse than 
I-FBS-OPT, as predicted in Section |5.5[ The performances of FISTA-AdOPT and 
FISTA-AdRe are similar. FISTA-AdOPT has the advantage that it will not oscillate 
like FISTA-AdRe, which has to be continually reset once the momentum exceeds the 
optimal value. However FISTA-AdOPT requires computation of an estimate of the 
locally optimal momentum which depends on smallest eigenvalue of A restricted to 
the support. 



Fig. 6.1. Experiment results: showing F{x^) — F* versus iteration k for several algorithms. 

7. Conclusions. In this paper, we applied a Lyapunov analysis to a family of 
inertial forward-backward splitting methods for convex composite minimization. We 
have proved weak convergence under the following broad conditions with the standard 
assumptions on the objective function: for the momentum parameter, 0 < Ofe < 1 
for all k and limsupofc < 1, and for the step-size, non-decreasing and 0 < Afc < 1/L. 
These conditions are more general than the specific sequences studied in m, m and 
m and less restrictive than the conditions derived in j28j. We considered in detail 
the behavior of TFBS applied to sparse optimization problems. With the aid of some 
results from the Lyapunov analysis we were able to show that TFBS achieves local 
linear convergence for these problems, with finite convergence of certain quantities. 
The local linear convergence results also apply to the FISTA-like methods of [T5] , [55] 
and [27] . 

An interesting direction of future research is to see if this local linear convergence 
behavior holds for a more general class of problems satisfying certain properties such 
as partial smoothness and local strong convexity, as considered in [30] for the forward- 
backward algorithm. 
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